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Abstract 

We consider the multiple breakpoint detection problem, which is concerned with detecting the loca¬ 
tions of several distinct changes in a one-dimensional noisy data series. We propose the breakpointError, 
a function that can be used to evaluate estimated breakpoint locations, given the known locations of true 
breakpoints. We discuss an application of the breakpointError for finding optimal penalties for breakpoint 
detection in simulated data. Finally, we show how to relax the breakpointError to obtain an annotation 
error function which can be used more readily in practice on real data. A fast C implementation of an 
algorithm that computes the breakpointError is available in an R package on R-Forge. 
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1 Introduction to segmentation models 


The goal of a segmentation model or algorithm is to divide a series of data into distinct segments. A major 
application of segmentation models is in detecting changes in copy number in cancer, using technologies such 
as array comparative genomic hybridization [Pinkel et al. 1998 . In these noisy biological data sets, the goal 
of segmentation is to detect the precise base pairs or genomic positions after which there are changes in copy 
number. 

How to evaluate the accuracy of a segmentation model? A new method for supervised segmentation of 
copy number data was proposed by Hocking et al. [2013 , who quantified the segmentation model accuracy 
using an annotation database containing visually-determined regions with or without breakpoints. This 
method depends critically on the definition of the visually-determined annotated region database, which is 
used to compute an annotation error function. 

This paper continues this line of research by defining the breakpointError function, which uses the true 
breakpoints to precisely compute the accuracy of a segmentation model. Also in this paper we demonstrate 
that the breakpointError is closely related to the annotation error, thus giving a theoretical foundation to 
the very practical new methods based on visually-determined annotated region databases. 

In this introduction, we first discuss a few motivating examples with figures. In Section 2 we discuss 
related work, and in Section 3 we define the breakpointError. In Section 4 we show an application of the 
breakpointError, and in Section 5 we discuss its relationship to the annotation error. In general we use bold 
to denote vectors (x, y^) and plain text to denote elements of those vectors {xi,y^) and scalars (p, ct^). 


1.1 Definition of breakpoints 

Assume there are P distinct positions in a series at which data could be gathered. Let V = {!,..., P} be 
the set of all such positions. For every position p G V, we assume there is some true probability distribution 
Dp. Let B = {1,..., P — 1} be all bases after which a breakpoint is possible. 

Definition 1. A breakpoint is any position p S B for which the next position does not have the same 
distribution: Dp yf Pp+i. 

For a series with P positions, there is a minimum of 0 breakpoints (Pi = • • • = Dp) and a maximum of 
P—1 breakpoints (Pi ^ ^ Dp). Note that the changes in distribution may be in mean, variance, or any 

other parameters that affect the distribution. 

The segmentation algorithm is given a sample of size d < P of data (pi, j/i),..., (pd, yd)i with positions 
Pi GV and noisy observations yi ~ Dp. for all samples i G {1,... ,d}. 
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For example, consider the normal distributions and simulated data shown in Figure The two panels 
show different, separate segmentation problems. The top panel shows a problem with two changes in mean, 
and the bottom panel shows two changes in variance. For both panels in the figure, there are P = 500 
distinct positions, d = 100 simulated samples, and two breakpoints: {300,400}. 



Figure 1: Two signals with the same breakpoints (vertical dashed lines) but different distributional changes. 
Black circles show d = 100 sampled data points drawn from normal distributions defined on positions 
V = {!,..., 500}. Horizontal line segments and shaded bands show mean ± one standard deviation of the 
true normal distributions Dp. The goal of segmentation is to recover the distributions and/or breakpoints, 
using only the sampled data points. 


1.2 Maximum likelihood segmentation algorithms 

A segmentation algorithm takes the d sampled data points as input, and returns a list of estimated dis¬ 
tributions and/or breakpoints. In this section, we will review one class of segmentation algorithms called 
maximum likelihood segmentation. 

A maximum likelihood segmenta tion model for multiple breakpoints in the mean of a normal distribution 


was proposed by 


Picard et al. 


2005 


. Let y = [ yi 


Vd 


be the vector formed by the d sampled 


data points, and let p = [ pi • • ■ Pd Y & be the corresponding vector of positions, ordered such that 


Pi <■■■< Pd- 
defined as 


Then for any number of segments k G (1,... ,d}, the estimated mean vector G 


y = arg mm 

xGR'^ 


l|y-x||2 


d-1 


subject to k — 1 = 1 


IS 


( 1 ) 






where ||x ||2 = is the squared £2 norm. Note that the optimization objective of minimizing the 

squar ed error is equivalent to maximizing the Gaussian likelihood with uniform variance [Picard et al. 


2005 


For a fixed A:„iax ^ d, we can quickly calculate for all k G {1,..., fc^ax} using pruned dynamic 
programming Rigaill [2010 . For any model size k, the estimated variance G K"*" is defined as the mean 
of the squared residuals: 


o-fe = 


- fc ||2 


/d. 


( 2 ) 
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The derivation is similar for the model of multiple breakpoints in the variance of a normal distribution 
Lavielle 2005 , and can be computed using the methods of Killick et al. |2012| or |Cleynen et al.||2014| . For 


both models, we visually represent the true distribution and estimates for k £ {2,..., 5} in Figure 



distribution 


■ truth 
estimate 


Figure 2: Comparing true and estimated distributions is one approach to segmentation model selec¬ 
tion/evaluation but is not the subject of this paper. Top 2 panels: the same reference/true signals as 
in Figure Others: estimated maximum likelihood models y^, for k € {2,..., 5} segments. 


1.3 Model selection 

The segmentation model selection problem may be posed as follows. Of the 4 estimated segmentation models 
k £ {2,... ,5}, which is the closest to the true model? 
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One method for segmentation model selection is to compare the true distribution with the estimated 
distributions (Figure |^, and choose the estimate whose distribution is closest to the true distribution. 
Assuming the true probability distributions Dp are available, one could compare them with the estimates 
using a distance function such as the earth mover’s distance [Rubner et al. 1997 , or some other distance 
function. However, the true distribution is not available in practice on real data, so in this paper we will not 
explore segmentation model selection via comparing distributions. 

Instead, we propose a method for comparing the true and estimated breakpoints. For any d-vectors of 
data and positions (x, p), we estimate the breakpoint locations using 


^(x, p) = { [{pj + pj+i)/2\ for all j e {1,..., d — 1} such that Xj ^ Xj+i}. (3) 

Thus for any model size fc, we estimate the breakpoint positions using (j){y^,p). In Figure]^ we compare 
these estimated breakpoints to the true set of breakpoints 

B = {jeM:D,^ (4) 



position 


breakpoints 

truth 

estimate 



Figure 3: Same as Figure but showing breakpoints instead of distributions. This paper proposes to 
compare the true and estimated breakpoints with the breakpointError function, which can be used for any 
reference/true distribution and any type of change. 
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Figure [^clearly shows 3 distinct types of errors that are possible in estimating the breakpoint positions: 

False negative (FN) for both data sets, the models with 2 segments are suboptimal because they only 
detect 1 of the 2 true breakpoints. 

False positive (FP) for both data sets, the models with 4 segments are suboptimal since they detect 3 
rather than 2 breakpoints. The models with 5 segments are even worse since they detect 4 breakpoints. 

Imprecision (I) of the two models with 3 segments, the breakpoints estimated for the change in variance 
data are more precise (closer to the true breakpoint positions). 

This paper proposes the breakpointError function (Figure]^, which can be used to quantify these intuitive 
observations. The breakpointError can be computed to quantify how well a set of estimated breakpoint 
positions matches a true or reference set of breakpoints. 


change in mean change in variance 



segments 

Figure 4: For the same data shown in Figure]^ we computed the breakpointError function (E) and its com¬ 
ponents. False positives (FP) occur when there are more estimated than true breakpoints, and false negatives 
(FN) are the opposite. For the correct model size (3 segments = 2 breakpoints), the imprecision function 
(I) quantifies the distance between the true and estimated breakpoint positions. The breakpointError is the 
sum of the other components (E=FP+FN+I). 


6 











2 Related Work 


This paper has been revised and expanded from Chapter 4 of the doctoral thesis of Hocking |2012 , which 
has not been previonsly published elsewhere. Differences include minor changes in notation, an expanded 
introduction, and more complete references. 

The main subject of this paper is the breakpointError (defined in Section]^, which is a function for 
precisely measuring the breakpoint detection accuracy of a segmentation model. There are several other 
approaches for evaluating segmentation models. Levy-Leduc and Harchaoui |2008| compared the number of 
detected breakpoints with the number of true breakpoints, ignoring the positions of the breakpoints. A more 
precise method was proposed by Pierre-Jean et al. |2014 , who checked if the detected breakpoints appear 
in regions of arbitrary size around the true breakpoints. In contrast, the breakpointError we propose in this 
paper has no arbitrary region size parameter. Bleakley and Vert [2011 used exact equality of the estimated 
and true breakpoint location in their asymptotic theoretical analysis. The breakpointError function is more 
precise since it is able to quantify that a guess close to a true breakpoint is better than a guess far from a 
true breakpoint. A final class of methods uses an annotated region database to quantify false positive and 
false negative breakpoint detections Hocking et al. 2013 Rigaill et al. 20131. An annotation database can 


be created by drawing regions on scatterplots of the data using a graphical user interface [Hocking et al. 


2014 . Evaluating a segmentation model via annotated regions is similar to the breakpointError function we 


propose in this paper, and the precise link between these methods will be explored in Section 

Section ^ shows one example application of the breakpointError function, for determining the optimal 
form of penalty functions in segmentation models for simulated data. Many related penalties have been 
proposed for the change-point detection problem. The standard AIC or BIC criteria are not well adapted 
in this context since the model collection is exponential [Birge and Mass^ 2007, Schwarz, 1978 Akaike 


1973 


2007 


Baraud et al. 2009 , and also because change-points are discrete parameters [Zhang and Siegmund 


Many criteria specifically adapted to change-point models have been proposed. For example, there 


are many different variants of the BIC Yao 

1988 

Lee 

1995 

Zhang and Siegmund 

2007|, and the model 

selection theory of Birge and Massart suggest other penalties 

Lavielle 

2005 

Lebarbier 2005 

Birge and 


Massart 2007 Arlot and Massart 2009]. The precise differences between these penalties and the penalties 


that we find will be discussed in Section [^ but the main difference is that the penalties discussed in this 
paper are specifically designed to minimize the breakpointError (rather than some other function, e.g. the 
squared error or negative log likelihood of the data). 
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3 Definition of the breakpointError 

Let us recall the notation of Section 1. Assume there are P distinct positions in a series at which data 
could be gathered. Depending on the desired application, these positions could be indices in a data vector, 
genomic positions, or time points. Let V = {!,... ,P} be the set of all such positions. For every position 
p G V, we assume there is some true probability distribution Dp. Let B = {1,..., P — 1} be all bases after 
which a breakpoint is possible, and let P = {p G B : Dp ^ Dp+i} be the set of true breakpoints. 

The segmentation algorithm is given a sample of size d < P of data (pi, yi ),..., {pd, yd), with positions 
Pi G V and noisy observations yi ^ Dp. for all samples i G {1,..., d}. The job of the segmentation algorithm 
is to return a breakpoint guess G C B. The object of this section is to define the breakpointError 
which quantifies the accuracy of the guess G with respect to the true breakpoints B. 

3.1 Desired properties of the breakpointError function 

We would like the breakpointError function • 2® —>■ K'*’ to satisfy the following properties: 

• (correctness) Guessing exactly right costs nothing: P^act(-®) = 0- 

• (precision) A guess closer to a real breakpoint is less costly: 

if P = {6} and 0 < i < j, then Pexact({^ + *}) < ^e^act({^ + J'l) and Pe^act({^ - d) < ^^e^act({^ - j})- 

• (FP) False positive breakpoints are bad: if & G P and g ^ B, then P^act({^}) < d})- 

• (FN) Undiscovered breakpoints are bad: b G B ^ Eexactii^}) < ^Scacti^)- 

In the next section we define the breakpointError, which satisfies all 4 properties. 



3.2 Definition of the breakpointError function 


In this section, we use the exact breakpoint locations B = {i?i,..., to define the breakpointError 
function. 

We define the error of a breakpoint location guess 5 € B as a function of the closest breakpoint in B. 
So first we put the breaks in order, by writing them as i3i < • • • < Bn- Then, we define a set of intervals 
Rb = {ri, • ■ • ,rra} that form a partition of B. For each breakpoint Bi we define the region = [r^,ri] G IB, 


where IB C 2* denotes the set of all intervals of ^ 
analysis literature [Nakao et al. 2010 


We take the notation conventions from the interval 


We define the upper limit of region i as 


ri 


P — 1 if i = n 

[{Bi+i + Bi)/2\ otherwise 


and the lower limit as 


1 if j = 1 

fi-i + 1 otherwise. 


The breakpoints Bi and regions are labeled for a small signal in Figure 


(5) 

( 6 ) 



14 9 10 14 


21 22 


base position 


Figure 5: For a small signal with 2 breakpoints, and for breakpoints i G {1, 2}, we plot the ii functions that 
measure the precision of a guess in The blue signal m G has 2 breakpoints: B = {4,14}. 

To emphasize the discrete nature of the data, N is drawn at each of the P = 22 distinct positions at which 
data could be gathered. 


Intuitively, if we observe a breakpoint guess g G ri, then its closest breakpoint is Bi. To define the best 
guess in each region, we use piecewise affine functions Cr^b,r : K —>■ [ 0 , 1 ] defined as follows: 


^r.b.r{9) — 


0 if g = b 

{b — g)/{x — r) if r < g < b 
{g — b)/{r — x) if b < g <f 
1 otherwise. 


For each breakpoint i we measure the precision of a guess g G B using 

^ii. 9 ) ~ ^r,,Bi,rii9) ■ 


(7) 


( 8 ) 
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These piecewise affine functions are shown in Figure [^for a small signal with 2 breakpoints. Note that there 
is some degree of arbitrary choice in the definition of the functions. For example, properly defined piecewise 


quadratic functions could also satisfy the precision property desired of the breakpointError (Section 3.11. 

Now, we are ready to define the exact breakpointError of a set of guesses C? C B. First, let C? n r be the 
subset of guesses G that fall in region r. 

Then, we define the false negative rate for region r as 


FN(G,r) = 


1 if G n r = 
0 otherwise 


and the false positive rate for region r as 


FP(G,r) = 


0 


if G n r = I 


|G n r| — 1 otherwise 
and the imprecision of the best guess in region r as 

0 if G n r = 


/(G,r,^) = 


miuggcnr otherwise. 


(9) 


( 10 ) 


( 11 ) 


When there are no breakpoints, we have B = % and Rb = %■ But we still would like to quantify the false 
positives, so let G \ ( U i?^) be the set of guesses G outside of the breakpoint regions Rb- 

Definition 2. The breakpointError of set of breakpoint guesses G with respect to the true breakpoints B 
is the sum of the False Positive, False Negative, and Imprecision functions: 

\B\ 

Eg,,,iG)=\G\iURB)\+J2 FP{G, Yi) + FN{G, Yi) + /(G, Y„Ii). 

i=l 

3.3 Implementation 

To compute the exact breakpointError, we first sort lists of n = \B\ and m = |G| items. Using the quicksort 
algorithm, this requires 0(n log n + mlogm) operations in the average case [Cormen et al. 1990 . Once 
sorted, the components of the cost can be computed in linear time 0(n + to). So, overall the computation 
of the error can be accomplished in best case 0(n + to), average case 0(n log n + to log to) operations. Its 
computation is implemented in efficient C code in the breakpointError R package on R-Forge, which can 
be installed in R using 


install.packages("breakpointError", repos="http;//r-forge.r-project.org") 
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4 Penalties with minimal breakpointError in simulations 

In this section, we show several examples of how to use the breakpointError function to determine penalties 
which minimize the train and test breakpointError in simulated data sets. In all cases, we will assume that 
there is a database of several piecewise constant signals with Gaussian noise. The goal is to learn a penalty 
constant that can be shared between signals with different properties. In each of the following sections, we 
will first present an empirical analysis of several simulated signals using the breakpointError. Then, we will 
discuss the relationship of our results to relevant theoretical results. 


4.1 Sampling density normalization 


The first problem we consider is finding a penalty that is invariant to sampling density. This is important 
because sampling density is often not uniform in real data sets. In fact, we see a sampling density between 
40 and 4400 kilobases per probe in the neuroblastoma data set of Hocking et al. 2013 . We would like to 


construct a single algorithm or penalty function that can be used for each of these segmentation problems. 

So to determine the form of the penalty function that can best adapt to this variation, we analyze the 
following simulation. We create a true piecewise constant signal m G over P = 70000 base pairs, with 
breakpoints every 10000 base pairs, shown as the blue line in Figure!^ Then, we define a signal sample size 
di € {70,..., 70000} for every noisy signal i € {!,..., z}. Let Ui € K'^be noisy signal i, sampled at positions 
Pi G with pii < ■ ■ ■ < Pi^di- We sample every probe j from the ^ N{mp.., 1) distribution. These 
samples are shown as the black points in Figure 

We would like to learn some model complexity parameter A on the first noisy signal, and use it for accurate 
breakpoint detection on the second noisy signal. In other words, we are looking for a model selection criterion 
which is invariant to sampling density. 



2.5 


0.0 


Q 

2.5 

I- 

Q. 

O 

o 2.5 

"D 

fo 0.0 


2.5 


5.0 


0 


20000 


40000 

position in base pairs 


60000 


Figure 6: Two noisy signals (black) sampled from a true piecewise constant signal (blue). Note that these 
are the same signals that appear in Figure [TOl 
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breakpointError BErri(A:) 


To attack this problem, we proceed as follows. For every signal i, we use pruned dynamic programming to 


calculate the maximum likelihood estimator yf G Q, for several model sizes k G /cmax} Rigaill 

2010 . Then, we define the model selection criteria 


fc“(A) = argminAfcd“ + ||yi -yf||i 

k 


( 12 ) 


Each of these is a function : IR+ —>■ {1,..., fcmax} that takes a model complexity tradeoff parameter A and 
returns the optimal number of segments for signal i. The goal is to find a penalty exponent a G K that lets 
us generalize A between different signals i. 

To quantify the accuracy of a segmentation for signal i, let BErri(fc) be the breakpointError of the model 
with k segments. This is a function BErr^ : {1,..., fcmax} —>■ R’*’, defined as 

BErr,(fc)=Ef_J</)(yf,p,)]- (13) 

where B is the set of real breakpoints in the true piecewise constant signal m. 

In Figure [Tj we plot BErr^ for the 2 simulated signals i shown previously. As expected, the model recovers 
more accurate breakpoints from the signal sampled at a higher density. 



Number of segments k in estimated maximum likelihood model 


Figure 7: Exact breakpoint error BErri(fc) for two signals i and several model sizes k. Note that these are 
the same error curves that appear in the Breakpoint panels of Figure 
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error Ef{\) 


Now, let us define the penalized model breakpoint error Ef : K+ —>■ K+ as 


i?“(A)=BErr,[fc“(A)]. (14) 

In Figure we plot these functions for the two signals i shown previously, and for several penalty exponents 

a. 

The dots in Figure show the optimal A found by minimizing the penalized model breakpoint detection 
error: 

A“ =argminF;“(A) (15) 

AGK+ 

Figure [^suggests that a « 1/2 defines a penalty with aligned error curves, which will result in A“ values 
that can be generalized between profiles. 



bases/probe 
o 7 
374 


-5 0 5 -5 0 5 -5 0 5 

model complexity tradeoff parameter logj^gC-^) 


Figure 8: Model selection error curves Ef{X) for 2 signals i and several exponents a. The penalty contains 
a term for the number of points sampled df. 
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error E°‘{\) 


Now, we are ready to define 2 quantities that will be able to help us choose an optimal penalty exponent 

a. 

First, let us consider the training error over the entire database: 

Z 

f;“(a) = ^f;“(a), (16) 

i=l 

and we define the minimal value of this function as 

E*{a) =mmE°{X). (17) 

In Figure]^ we plot these training error functions and their minimal values E* for several values of a. 
It is clear that the minimum training error is found for some penalty exponent a near 1/2, and we would 
like to find the precise a that results in the lowest possible minimum E*{a). 



-5 0 


-5 0 5 


-5 0 


model complexity tradeoff parameter log]^o("^) 


Figure 9: Training error functions E°‘ in black and their minimal values E*{a) in red. The penalty contains 
a term for the number of points sampled d“. 


14 














total error relative to true breakpoints (breakpointE] 


We also consider the test error over all pairs of signals when training on one and testing on another: 


TestErr(a) = ^ E“(A“). (18) 

In Figure we plot E* and TestErr for a grid of a values. It is clear that the optimal penalty is given 
by a = 1/2. This corresponds to the following model selection criterion which is invariant to the number of 
data points sampled di (for different simulated signals i with the same true breakpoints): 

ki{X) = a.Tg min Xk^i + \\yi-yi\\l (19) 

k 



penalty exponent a 


Figure 10: Train and test breakpoint detection error as a function of penalty exponent a. The penalty 
contains a term for the number of points sampled df. Mean error is drawn as a black line, with one standard 
deviation shown as a grey band. 
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As explained by Arlot and Celisse |2010 , a model selection procedure can be either efficient or consistent. 
An efficient procedure for model estimation accurately recovers the true piecewise constant signal, whereas 
a consistent procedure for model identification accurately recovers the breakpoints. Since we attempt to 
minimize the breakpointError, we are attempting to construct a consistent penalty, not an efficient penalty. 

In general terms, the fact that we find a nonzero exponent a for our cf“ penalty term agrees with other 
results. In particular, Arlot [2008 proposed an optimal procedure to select model complexity parameters in 
cross-validation by normalizing by the sample size di. 

The \/di term that we find here using simulations is in agreement with 


Fischer 


2011 


who used finite 


sample model selection theory to find a term in a penalty optimal for clustering. 

When theoretically deriving an efficient penalty for segmentation model estimation in the non-asymptotic 
setting, Lebarbier |2005 obtained a logd^ term. This contrasts our result, which attempts to find a consistent 
penalty, and uses the breakpointError to find a \/di penalty term. But in fact this is in agreement with 
classical results that the efficient AIC underpenalizes with respect to the consistent BIC, as shown in Tablej^ 


Efficient 

Penalty 

Consistent 

Penalty 

Model 

Term 

Model 

Term 

AIC 

2 

BIC 

\ogd. 

Lebarbier 

logd* 

This work 

di 


Table 1: Comparing our results with Lebarbier, in the context of classical results involving AIC and BIC. The 
BIC is designed for model identification and penalizes more than the AIC. Likewise, our penalty examines 
model identification using the breakpoint detection error, and penalizes more than the efficient penalty 
proposed by Lebarbier. 
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signal 


4.2 Signal length normalization 

In real array CGH data, we need to analyze chromosomes of varying length in base pairs. For example, 
human chromosome 1 is the largest at about 250 mega base pairs, and chromosome 22 is the smallest with 
only about 36 mega base pairs. But we expect that the number of breakpoints is proportional to the length 
of the chromosome in base pairs, and we would like to design a model selection criterion that is invariant to 
the signal length. 

So as a first step toward constructing a penalty that is invariant to the number of breakpoints, we consider 
the following simulation where we fix the number of points sampled at di = 2000, and vary the length of the 
signal sampled. In Figurewe show samples of 2 different lengths li, for the same true piecewise constant 
signal m. This simulation is somewhat unrealistic since the number of data points di in real data sets is 
usually proportional to the signal length li. We will consider a more realistic simulation model and a more 
complicated penalty in the next section. 



0 


20000 40000 60000 80000 

position in base pairs 


Figure 11: Samples of 2 different lengths li but constant number of points d = 2000. 
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For each signal i, we define the penalty 

fcf(A) = argminAfcZf + ||y* -yflla, (20) 

k 

where k is the length of the signal in base pairs. The goal will be to find a /? that can be used for signals of 
varying length. 

In Figure [T^ we show the breakpoint detection error curves for two signals and several penalty exponents 
j3. These curves seem to align when /3 = —1/2. 
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Figure 12: Breakpoint detection error curves for several penalty exponents /3 and 2 samples of varying length 
in base pairs k. The penalty contains a term . 
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total error relative to true breakpoints (breakpointError 


In Figure[^ we plot the train and test error curves over the entire set of simulated signals. These curves 
indicate minimal breakpoint detection error at (3 = —1/2, corresponding to the following penalty: 


k,{X) = argmin ^ + ||y, - 
k Vh 


|2 

l2- 


( 21 ) 



penalty exponent /3 


Figure 13: Train and test error curves for signals of different length in base pairs li. The penalty contains a 
term if. 

Interestingly, the Ijy/Ti term that we obtain here is in good agreement with our previous result that 
the optimal penalty for variable sampling density di should have a ^/di term. In particular, we can re¬ 
parameterize the problem to be in terms of the number of points sampled per segment pi = di/ki. In 
Section [4T| we he ld constant but in this section we hold di constant. In both cases we have a penalty with 
a = y/di/ki term. 

However, we do not know the number of segments ki in advance. But we supposed that the number of 
segments is proportional to the number of base pairs li, so we can use that in the penalty. This suggests 
a penalty that takes the form of ^/diJU. So in the next section, we confirm that this intuition works for 
constructing an optimal penalty. 
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4.3 Combining normalizations 

In this section, we show that we can combine the results of the previous sections to create composite invariant 
penalties. In particular, to normalize for sampling density di and length in base pairs li, we need and 
^iVh terms in the penalty, respectively. This suggests that when considering variable di and li, we need a 
•y/ di/li term in the penalty, and in this section we show that this intuitive construction results in an optimal 
penalty. 

In Figure [Tdl we plot 2 signals with different number of points di and length in base pairs li. In particular 
we tested di G {50,..., 1000} and li G {200,..., 1000}. We would like to find a penalty that allows us to 
generalize model complexity tradeoff parameters A between these signals. 

For each signal i, we define the penalty 

= argminAfcdi“Zf + ||yi - y/\\l, (22) 

k 

where h is the signal length in base pairs and di is the number of points sampled. We will attempt to 
determine a pair of a and /3 values that allow accurate breakpoint detection in signals of varying length and 
number of points sampled. Based on the results in Sections 
/3 = -l/2. 
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Figure 14: Two signals with a different number of points di and length in base pairs li. 
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4.1 and 4.2 we expect to find a = 1/2 and 
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alpha (exponent of number of data points) 


In Figure 15 we plot the train and test breakpoint error functions as a function of both a and /3. It is 
clear that the minimum is achieved by penalties near a = Il2,l3 = —1/2, which corresponds to a penalty of 


= argmin + l|y* - y*'"! 

k 


(23) 
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Figure 15: Train and test error functions for several signals of varying number of points di and length li. 
The penalty contains a term dfl^. Mean error values normalized to [0,1], minimum values indicated in red, 
and expected value a = 1/2,/? = —1/2 in white. 
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4.4 Optimal penalties for the fused lasso signal approximator 


In the previous sections, we used theoretical arguments and simulation experiments to determine the optimal 
penalties for maximum likelihood segmentation 0. In this section, we demonstrate that the same approach 
can be used to find optimal penalties for another model, the Fused Lasso Signal Approximator (FLSA). 

We used the f Isa function in version 1.03 of the flsa package from CRAN to calculate the FLSA [Hoefling' 


2009 


Let X e be the noisy copy number signal for one chromosome. The FLSA solves the following 


optimization problem: 


. 1 

arg mm - 
mGR'* 2 


d d d—1 

'^{Xj - rrijf + Ai ^ \mj\ + A 2 ^ 

i=i i=i 




(24) 


First, we take Ai = 0 since we are concerned with breakpoint detection, not signal sparsity. In this 
section, our aim is to determine a parameterization for A 2 that we will be able to find similar breakpoints in 
signals of varying sampling density. 

We use the same setup that we used to determine optimal penalties for maximum likelihood segmentation, 
as described in Section [4T| and shown again in Figure [THl 
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Figure 16: Simulated signals with different sampling density. 
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error Ef{\) 


In particular, for every signal j € {1,..., z}, let G be the noisy data, sampled at positions pi G . 
To find an optimal penalty for these data, first let A 2 = Ac?“. For each signal i, exponent a G K, and tradeoff 
parameter A G K"*", we define the optimal smoothing as 


=argmini||yi-m||^ + Ad“ ^ |m^ -TOj+i|. 


di-1 


(25) 


Then, we define the breakpoint detection error as a function of the breaks in the smoothed signal: 


(t> (yf’“,p*) 


(26) 


E?{\) = iffxa 
where the breakpoint function </) is defined in (§. 

We plot Ef for 2 signals i and several penalty exponents a in Figure 17 Note that the functions appear 
to align when a = 1. 
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Figure 17: Model complexity breakpoint error functions Ef. 
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total error relative to true breakpoints (breakpointE] 


To evaluate which penalty parameter a results in optimal fitting and learning, we computed train error 
E* and TestErr as defined in (161 and (18l. These functions are plotted in Figure 18 and suggest that a 
value of a = 1 is optimal. This analysis suggests that taking A 2 = \di is optimal for breakpoint detection 
using ELSA. This agrees with the observation of Hocking et al. 2013 that the flsa.norm penalty with a di 


term works better than the un-normalized flsa penalty. 

However, we obtained a different penalty (a = 0.5) in Section 4.1 for another model, maximum likelihood 
segmentation. These differences in optimal a values are due to the differences in how model complexity is 
measured in the two models. Maximum likelihood segmentation measures model complexity using the io 
pseudo-norm of the difference vector of m, whereas the FLSA uses the ^i-norm. 

We conclude by noting that this procedure could also be applied to find penalties for FLSA that depend 
on other signal properties such as length in base pairs k. However, we did not pursue this since FLSA does 
not work as well as maximum likelihood segmentation in practice on real data [Hocking et al. 2013 . 



penalty exponent a 


Figure 18: Train and test error as a function of penalty exponent a. The penalty has a term for the number 
of points sampled d“. 
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4.5 Applying the penalties to real data 


In Sections |4.1||4.2[ we found penalties with minimum breakpointError for simulated data with varying 
number of data points sampled di and length li in base positions (with li proportional to the number of 


breakpoints). In Section 4.3 we demonstrated that these results can be combined. We also found that the 
optimal penalty should include a term for the estimated variance sf 
the following penalty, for every signal i: 


Hocking 


2012 


These results suggested 


ki{X) = arg min A/cs ■ + l|y* “ fil 

k 


(27) 


In Table we report results of using the suggested penalties on the neuroblastoma data set described 

by Hocking et al. |2013 . The cghseg.k penalty which was found to be the best by Hocking et al. |2013 


has a term for number of data points sampled di (no square root) but no terms for length li nor estimated 
variance Si. The penalty terms suggested in this section do not improve breakpoint detection error in the 
neuroblastoma data set. This observation suggests that distribution that generates the real data is more 
complex than the simple simulation model considered in this paper. 



points 

length 

variance 

train 

test.mean 

test.sd 

cghseg.k 

1 

0 

0 

2.19 

2.20 

1.01 

cghseg.k.var 

1 

0 

2 

2.46 

2.73 

1.98 

cghseg.k.sqrt.d 

1/2 

0 

0 

3.51 

3.87 

1.58 

cghseg.k.sqrt 

1/2 

-1/2 

0 

4.30 

6.11 

5.02 

cghseg.k.sqrt.d. var 

1/2 

0 

2 

3.19 

4.47 

5.02 

cghseg.k.sqrt. var 

1/2 

-1/2 

2 

4.18 

6.38 

7.61 


Table 2: Breakpoint detection error of several penalties on the neuroblastoma data set, with 1 row for each 
penalty. The exponent of the number of data points di, length li, and variance Si terms in the penalty is 
shown with the train and test annotation error (percent incorrect regions). 


Practically speaking, we still would like to find a penalty with optimal breakpoint detection for any par¬ 
ticular real data set such as the neuroblastoma data. Rigaill et al. |2013 achieved state-of-the-art breakpoint 
detection in the neuroblastoma data set by learning the penalty constants using a training data set of man¬ 
ually annotated regions. For the rest of this paper, we will discuss the relationship of the breakpointError 
to these annotation-guided methods. 
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5 Annotation error functions for real data sets 


In this section, we define several annotation error functions which can be used in real data sets (Table |^. In 
real data, we do not have access to the true piecewise constant signal m, nor the underlying set of breakpoints 
B. So the breakpointError defined in the Section 3 is not readily computable. We will first show how in real 
data, we can compute another function called the incomplete annotation error. Then, we will demonstrate 
its relationship to the breakpointError using the complete annotation error function. 


Section 

“““ 

O 


Error function 
breakpointError 
incomplete annotation error 
complete annotation error 
01 annotation error 


Symbol 


-^exact 

TpA 

'^incomplete 

jpA 

■^complete 




Need counts incorrect 

true breakpoints B guesses 

some annotations A guesses 

all breakpoint annotations A guesses 

some annotations A regions 


Table 3: Several breakpoint detection error functions, and how much prior knowledge is needed to compute 
each. The breakpointError needs the most prior knowledge and can only be used in simulations when the 
true breakpoints B are known. In contrast, the incomplete/01 annotation error can be used in real data sets 
by using visual inspection of scatterplots to create annotations A. 


5.1 Incomplete annotation error for real data 

By plotting a real data set, we can easily identify regions that contain breakpoints by visual inspection, as 
shown in Figure |19| 



position in base pairs 


annotation 

1 breakpoint 
Obreakpoints 

imprecision 

I — breakpoint 
I — annotation 



Figure 19: Top: simulated noisy signals (black) with their true piecewise constant signals m (blue) and 
visually-determined breakpoint annotations A (red). Middle: negative annotations A^ constructed using 
(311. Bottom: breakpoint detection imprecision curves for the breakpointError £i 0 and the annotation 
error li (331. 


Recall that there are P distinct positions in a series at which data could be gathered, and that B = 
{1,..., P — 1} is the set of all positions after which a breakpoint is possible. 
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Definition 3. A set of n annotations can be written as A = {(ai, ri),..., (a„, r„)}. For each annotation i, 
Ti C IB js an interval that defines the region, and Oi C {0, 1,..is an interval of allowable breakpoint counts 
in this region. 

For example, consider the annotated regions in Table 

i Allowed breakpoints at Region 


1 {0} [5,10] 

2 {1} [20,30] 

3 {1,2,...} [40,70] 

4 {0} [80,100] 


Table 4: Sample annotated regions for a signal sampled on P = 100 base pairs. An annotation Oi indicates 
how many breakpoints are allowed in the corresponding region r^. There are 0 breaks in bases 5-10 and 
80-100. There is exactly 1 break in bases 20-30. There is at least 1 break in bases 40-70. 


Given a set of breakpoint guesses G C B, we define the annotation-dependent false positive count as 


F"P(G, r, a) = (|G n r| — max(a))_|_ 
where the positive part function is defined as 


x+ 


X if a: > 0 
0 otherwise. 


(28) 


(29) 


Similarly, the annotation-dependent false negative count is defined as 

FN(G, r, o) = (min(a) — |G n r|)^. (30) 

Definition 4. Let A be a set of annotations and G C B a set of breakpoint guesses. The incomplete 
annotation error is the count of annotation-dependent false positives and false negatives: 


E. 


A 

incomplete 


(G) = ^ FP{G, r„ a,) + FW(G, r„ a,). 


In the case of analyzing the simulated signals in the top panels of Figure let us consider the set of 6 
annotations A = {(oi, fi),..., (oe, fg)} depicted using the red rectangles. These rectangles were determined 
by visual inspection of the scatterplots. I used the SegAnnDB interactive annotation web site to view the 
data and save a database of 6 regions per profile [Hocking et al.[ 2014] . Every region contains exactly 1 
breakpoint, so we have di = {1} for every annotation i € {1,..., 6}. In real data we will probably only be 
able to see a subset of the real breakpoints, but we analyze the complete set of breakpoints in these simulated 
data to illustrate the approximation induced by the annotation process. 

Given any set of non-intersecting annotations A, we can write Zii < ' ’ ’ < Hn order the regions. Then 
we can define |A| -|- 1 negative annotations as 


A° = {(0,[l,ri-l]), (0,[ri + l,r 2 -l]), ..., (0, [r„_i + 1, r„ - 1]), (0, [r„ + 1, d - 1])}, (31) 

as drawn with yellow rectangles in the middle panels of Figure]^ We will use the complete set of annotations 
A U A° to define the annotation error £’;'^[{j^p[g^g(G) for breakpoint guesses G given by models of these 
simulated signals. 
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In Figure we plot some model selection error functions for the 2 simulated signals shown in Figure [T^ 
It is clear that the annotation error is a good approximation of the breakpointError, and there are several 
interesting observations to note. 


• Signal: in these simulated data, the true piecewise constant signal m is available, so an efficient model 
selection procedure [Arlot and Celiss^ |2010| would select the estimated model which is closest to the 
true signal. That idea is illustrated in Figure and can be used in this context by minimizing 


Asigna^fc) — lo§10 




2=1 


(32) 


— In Figure 20 for the signal sampled at 7 bases/probe, the minimum of the Signal error identifies 
a model with 7 segments. 

— For the signal sampled at 374 bases/probe, the minimum of the error identifies a model with only 
5 segments. 


• Breakpoint: in these simulated data, the true breakpoints B are available, so we can compute and 
minimize the breakpointError as a consistent model selection procedure [Arlot and Celiss^ 2010j . For 
both signals, the minimum of the breakpointError identifies a model with 7 segments (6 breakpoints). 


• Annotation: we use a set of annotated regions to compute the incomplete annotation error, which also 
identifies a model with 7 segments. It is clear that the annotation error is a good approximation of the 
breakpointError. In the next section, we explicitly demonstrate the link between the breakpointError 
and the annotation error. 



Figure 20: Model selection error curves for 2 simulated signals. Minima are highlighted using circles. 
Signal: the log squared error Agignai of the estimated signal with respect to the true piecewise constant 
signal (see text). 

Breakpoint: exact breakpointError 

Annotation: incomplete annotation error . 
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5.2 Link with breakpointError using complete annotation error 

It is clear from Figure that the annotation error is a good approximation of the exact breakpoint error 
when the annotations A agree with the real breakpoints B. In this section, we make this intuition precise 
by showing exactly how to relax the breakpointError to obtain the annotation error. There are two steps: 

1. We define the complete annotation error by relaxing the definition of the exact breakpointError. 

2. We show that the complete annotation error is equivalent to the incomplete annotation error when we 
have a complete set of annotations. 

We will define the complete annotation error as a relaxation of the exact breakpointError. Recall from 
Definition!^ the exact breakpointError is 

|S| 

i^exact(G) = |G \ (URb)| + 5] FP(G, rO + FN(G, r,) + I(G, r„ G). 

i=l 

To define the complete annotation error, we perform two relaxations: 

• Instead of using the true breakpoints B (which are unknown in real data) with equations and (§ 
to determine a breakpoint region r^, we use the region determined by visual inspection. 

• Rather than the piecewise affine imprecision £i, we use the zero-one imprecision 

1(g) = (33) 

We show this relaxation by ploting the imprecision functions £i and G in the bottom panels of Figure [l^ 

Definition 5. Assume there are n breakpoints Bi < • • • < i?„, and we observe a set of annotations 
A = {(oi, ri),..., (a„, f„)} eaeh with Oi = 1 breakpoint, sueh that Bi S fi,...,R„ € f„. The complete 
annotation error of a set of breakpoint guesses G is the sum of false positive and false negative eounts: 


^complete 


(G) 


G\(ui) 


| 4 | 

+ ^ FP(G, f,) + FNiG, f,) + J(G, f„ £,) 


G\(ui) 


^ FP(G,v) + FNiG,v). 

{a,f)GA 


It is clear that E^ompiete depends on the annotations only through their regions. In particular, the 
annotated breakpoint counts Oi = {1} are not used in this definition, since we assumed that each region 
contains exactly 1 breakpoint. Also, since we used the zero-one imprecision for £i, the imprecision function 
I is always zero. 


Proposition 1. Let A^ be a set of negative annotations as in (31). Then for a set of breakpoint guesses G, 
the incomplete and complete annotation error functions are equivalent: 


pAuA° 

^incomplete 


(G) — EfompleteiG)- 


Proof. To see the connection between the complete and incomplete annotation error functions, first note 
that 


FN(G,r,{l}) 


(l-|Gnr|)^ 

FN(G,r), 
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(34) 







and 


FP(G,r,{l}) = (|Gnr|-l)^ 

= FP(G,r). 


( 35 ) 


For the complete annotation error we quantified the false positive rate of the breakpoints that fall outside 
of the breakpoint regions A using G \ (u4). For the incomplete annotation error, we instead created a set 
of 0-breakpoint annotations for this purpose. Note that by construction of the negative regions in (311, 
we have 

G \ (ui) = G n (ui°) , (36) 

or in words, the guesses outside of the breakpoint annotations A are in the negative annotations A^. So 
using (36), we have 


FP(G,(ui°),{0}) = |Gn(ui°)| 
= |G\(ui)|, 


(37) 


which is the first component of the complete annotation error. 

Recall that A represents annotated regions that each contain exactly 1 breakpoint, and A^ are regions 
with no breakpoints. So using (34), (351, and (37), we have that the incomplete annotation error is equivalent 
to the complete error: 


T7'^ 


AuA° 


incomplete 


(G) = E FP(G,r,a)-k ^ FP(G, r, a)-k FN(G, r, a) 

(a,r)GAO (a,r)GA 

= FP(G,ui°,{0})+ ^ FP(G,r,{l})+FN(G,r,{l}) 

(a,r)GA 

= |G\(ui)|+ ^ FP(G,r)+FN(G,r) 

(a,r)GA 


= E. 


complete 


(G). 


(38) 

□ 


So in fact the incomplete annotation error is equivalent to the complete error when the annotated regions 
A each contain exactly 1 breakpoint. But we call this the incomplete error since it is also well-defined for 
arbitrary sets of regions A. 
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5.3 Zero-one annotation error 


The incomplete annotation error counts incorrect breakpoints. In this section, we show that by thresholding 
the incomplete annotation error, we can obtain the zero-one annotation error function. This is the original 
annotation error function that was introduced by Hocking et al. 2013 , who used it to count the number of 
incorrect regions. 


First, let us define the zero-one thresholding function t : —>■ Z+ as 

— la;^0 — 


1 if X 0 
0 otherwise. 


(39) 


The idea of thresholding is to limit the error that any one annotation can induce. We define the zero-one 
annotation error as 


-l- t 


FN(G,r,a) 


<(G) = E qFP(G,r,a) 

(a,r)GA 

— ^ ^ l|Gnr|>max(a} l|Gnr|<min(a) 

{a,r)GA 

= l|Gnr|ya- 

(a,r)GA 


(40) 


So using the zero-one annotation error, we count incorrect annotated regions instead of incorrect breakpoint 
guesses. 
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5.4 Comparing annotation error functions 

In practice, we have few annotated regions per signal in real data. In Figure !^ we show how the annotation 
error is degraded as we remove annotations. In particular, it is clear that using the thresholded zero-one 
annotation error significantly degrades the approximation of the FP curve. Nevertheless, it is worth noting 
that minimum of the zero-one error still uniquely identifies the correct model with 7 segments. Even after 
removing many annotations, the minimum error still identifies the correct model, but not uniquely. 



segments k of estimated signal 


Figure 21: Comparison of annotation error functions as the set of annotations changes. Minima are high¬ 
lighted using circles. 

Complete: annotation error for a complete set of 6 positive and 7 negative annotations. 

Zero-one: zero-one annotation error for a complete set of 6 positive and 7 negative annotations. 
Incomplete: zero-one annotation error for 3 positive and 4 negative annotations. 

Positive: zero-one annotation error for 3 positive annotations. 


In conclusion, this section has discussed the connections between the breakpointError and the annotation 
error functions. Whereas the breakpointError is computable only when the true set of breakpoints is known 
(e.g. simulated data), the annotation error is readily computable in any data set using a set of visually 
determined annotations. We showed that if the annotations are consistent with the true breakpoints, then 
the annotation error function is a good approximation of the breakpointError (Figure 201. Finally, we 
observed that even after thresholding and removing annotations, the annotation error function can still be 
used to identify a set of minimum error segmentation models (Figure [M]). 
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6 Conclusions and future work 


In this paper we defined the breakpointError, which can be used to quantify the breakpoint detection 
accuracy of a segmentation model, when the true breakpoint positions are known. In Section 4 we showed 
one application of the breakpointError for determining optimal penalty constants in several simulated data 
sets. In Section 5 we discussed the relationship of the breakpointError to the annotation error, which has 
been used for supervised segmentation of real data sets [Hocking et al. 2013 Rigaill et al., 2013 Hocking 


et al. 2014 . We showed that the annotation error is a good approximation of the breakpointError when the 


annotated regions agree with the true breakpoints. This provides some justification for using the annotation 
error in supervised analysis of real data sets. 

For future work, it will be interesting to apply the breakpointError to more realistic tasks. For example, 
[2014 proposed to evaluate breakpoint detection algorithms by adding breakpoints and 


Pierre-Jean et al. 


noise to real data sets. In their framework, the true breakpoint positions are known, and a region around 
each breakpoint is used to quantify the number of true and false positive breakpoint detections. Instead of 
using the zero-one loss with an arbitrarily sized region, the breakpointError could be used to more precisely 


quantify breakpoint estimates, since it counts imprecision (111 in addition to false positive and false negative 
breakpoint detections. 

To facilitate the use of the breakpointError in future work, it is implemented in the R package breakpointError 
on R-Forge. It can be installed in R using 


install.packages("breakpointError", repos="http;//r-forge.r-project.org") 


Acknowledgements: Thanks to Marco Cuturi for references about distance functions for comparing 
probability distributions. Thanks to Guillem Rigaill for helpful comments on a preliminary version of this 
paper. 
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